Introduction

The 2016 election cycle has been one of the most unusual in American history. In this election, the politicians experienced with running states and nations are often floundering while those with little experience are promising a new brand of leadership. The 2016 election is also marked by debates over fundraising. It is the first presidential election since the Supreme Court Citizen’s United decision where campaigns have been fully taking advantage of Super PAC funding. Bernie Sanders boasts often of his low average donations; Donald Trump was for a while self-funding his campaign. In this light, I will investigate direct donations to all candidates, failed or still in the running, and examine the effects of various variables on fundraising.

##    cmte_id               cand_id                           cand_nm     
##  Length:56764       P00003392:15980   Clinton, Hillary Rodham  :15980  
##  Class :character   P60007168:14620   Sanders, Bernard         :14620  
##  Mode  :character   P60005915: 7826   Carson, Benjamin S.      : 7826  
##                     P60006111: 6675   Cruz, Rafael Edward 'Ted': 6675  
##                     P60006723: 3102   Rubio, Marco             : 3102  
##                     P60007242: 2299   Fiorina, Carly           : 2299  
##                     (Other)  : 6262   (Other)                  : 6262  
##               contbr_nm            contbr_city    contbr_st 
##  IMEOKPARIA, OSI   :   52   LOS ANGELES  : 4082   CA:56764  
##  BATTS, ERIC       :   43   SAN FRANCISCO: 3747             
##  BUCCHERE, CHRIS   :   41   SAN DIEGO    : 1887             
##  HANNON, STEPHANIE :   40   SAN JOSE     : 1167             
##  PAULISSIAN, MARTHA:   40   OAKLAND      :  921             
##  OFTEDAL, EGIL MR. :   37   SACRAMENTO   :  775             
##  (Other)           :56511   (Other)      :44185             
##      contbr_zip         contbr_employer              contbr_occupation
##  920274404:   56   RETIRED      :10028   RETIRED              :12059  
##  943061338:   52   NOT EMPLOYED : 5473   NOT EMPLOYED         : 4713  
##  913565823:   46   SELF-EMPLOYED: 4784   ATTORNEY             : 1947  
##  958256321:   44   N/A          : 3035   HOMEMAKER            : 1436  
##  941311708:   43   SELF EMPLOYED: 2129   INFORMATION REQUESTED: 1217  
##  949601336:   41   (Other)      :31233   (Other)              :35380  
##  (Other)  :56482   NA's         :   82   NA's                 :   12  
##  contb_receipt_amt   contb_receipt_dt                     receipt_desc  
##  Min.   :-10000.0   30-SEP-15: 3279                             :55408  
##  1st Qu.:    35.0   30-JUN-15: 1665   Refund                    :  349  
##  Median :   100.0   29-SEP-15: 1305   REDESIGNATION TO GENERAL  :  258  
##  Mean   :   474.1   23-SEP-15:  982   REDESIGNATION FROM PRIMARY:  254  
##  3rd Qu.:   250.0   15-SEP-15:  924   REATTRIBUTION FROM SPOUSE :   94  
##  Max.   : 10800.0   28-SEP-15:  878   REATTRIBUTION TO SPOUSE   :   94  
##                     (Other)  :47731   (Other)                   :  307  
##  memo_cd                                 memo_text      form_tp     
##   :55539                                      :42982   SA17A:56161  
##  X: 1225   * EARMARKED CONTRIBUTION: SEE BELOW:12193   SA18 :  254  
##            EARMARKED FROM MAKE DC LISTEN      :  387   SB28A:  349  
##            REDESIGNATION TO GENERAL           :  258                
##            REDESIGNATION FROM PRIMARY         :  254                
##            REATTRIBUTION FROM SPOUSE          :   94                
##            (Other)                            :  596                
##     file_num              tran_id      election_tp  
##  Min.   :1003942   SA17.300674:    3        :   49  
##  1st Qu.:1024052   SA17.365962:    3   G2016:  951  
##  Median :1029414   C1013462   :    2   O2016:  281  
##  Mean   :1026768   C1015104   :    2   P2016:55482  
##  3rd Qu.:1029462   C1015363   :    2   P2018:    1  
##  Max.   :1029674   C1015437   :    2                
##                    (Other)    :56750
## 'data.frame':    56764 obs. of  18 variables:
##  $ cmte_id          : chr  "C00577130" "C00577130" "C00577130" "C00577130" ...
##  $ cand_id          : Factor w/ 21 levels "P00003392","P20002721",..: 10 10 10 10 10 10 10 10 10 9 ...
##  $ cand_nm          : Factor w/ 21 levels "Bush, Jeb","Carson, Benjamin S.",..: 17 17 17 17 17 17 17 17 17 16 ...
##  $ contbr_nm        : Factor w/ 24895 levels "A DOSS, CANDALON",..: 5436 8076 8120 8262 16036 16049 16113 10769 10799 19940 ...
##  $ contbr_city      : Factor w/ 1022 levels "","29 PALMS",..: 794 39 174 791 613 110 38 791 828 505 ...
##  $ contbr_st        : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : Factor w/ 21215 levels "","00000","11205",..: 14807 19762 5938 7509 16599 5657 4141 7356 18357 2567 ...
##  $ contbr_employer  : Factor w/ 9407 levels ""," APPLE INC.",..: 206 5791 5791 7323 5791 5791 5791 8753 5791 2816 ...
##  $ contbr_occupation: Factor w/ 4464 levels "",".COM EXECUTIVE",..: 3842 2615 2615 3601 2615 2615 3427 2489 2615 3294 ...
##  $ contb_receipt_amt: num  50 100 1000 5 196 ...
##  $ contb_receipt_dt : Factor w/ 274 levels "01-APR-15","01-AUG-15",..: 93 170 170 170 189 170 170 52 52 215 ...
##  $ receipt_desc     : Factor w/ 22 levels "","2016 SENATE PRIMARY DONOR REDESIGNATION FROM PRIMARY",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_cd          : Factor w/ 2 levels "","X": 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_text        : Factor w/ 88 levels "","*","* EARMARKED CONTRIBUTION: SEE BELOW",..: 3 3 3 3 1 3 3 3 3 1 ...
##  $ form_tp          : Factor w/ 3 levels "SA17A","SA18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ file_num         : int  1029414 1029414 1029414 1029414 1029414 1029414 1029414 1029414 1029414 1029436 ...
##  $ tran_id          : Factor w/ 56511 levels "A000771210424405B8CF",..: 47278 48223 48147 48200 50425 48204 48202 46887 46898 37882 ...
##  $ election_tp      : Factor w/ 5 levels "","G2016","O2016",..: 4 4 4 4 4 4 4 4 4 4 ...
##  [1] "Bush, Jeb"                 "Carson, Benjamin S."      
##  [3] "Christie, Christopher J."  "Clinton, Hillary Rodham"  
##  [5] "Cruz, Rafael Edward 'Ted'" "Fiorina, Carly"           
##  [7] "Graham, Lindsey O."        "Huckabee, Mike"           
##  [9] "Jindal, Bobby"             "Kasich, John R."          
## [11] "Lessig, Lawrence"          "O'Malley, Martin Joseph"  
## [13] "Pataki, George E."         "Paul, Rand"               
## [15] "Perry, James R. (Rick)"    "Rubio, Marco"             
## [17] "Sanders, Bernard"          "Santorum, Richard J."     
## [19] "Trump, Donald J."          "Walker, Scott"            
## [21] "Webb, James Henry Jr."

The election is currently fairly crowded with 21 candidates in both parties. By number of donors, though, most of them aren’t competing at all:

The first thing that I noticed about the graph was that the names were too long to fit nicely onto the graph. I decided to cut off first names to simply to the last name of the candidate so that names are more easily visible in graphs:

Clearly, the two major Democratic candidates (Sanders and Clinton) are leading by a long shot in terms of number of contributions. This makes sense, as California is a heavily Democratic state. I decided to add political party information into the data.

After adding political party affiliation, I decided to also include the current polling numbers (in California) of the candidates. This represents the percentage of voters belonging of the candidates’ respective parties who say they would vote for the candidate.

Of course, it isn’t a good idea to compare the polling of candidates of different parties, as they aren’t polling against each other at this point. Some candidates, although they have received donations have 0 support in polls, although this is often because they recently dropped out.

I then decided to create a scatterplot matrix to see what I should investigate next.

Because of the absense of much quantitative data, the scatterplot matrix did not prove particularly insightful. However, I was intrigued by the monetary value of donations, as well as the different cities and towns contributing. First, I decided to investigate the monetary value of donations.

I realized that there were some contributions which were negative, usually because they were “Redesignations”, “Reattributions” or “Refunds”. They were usually paired with another value with the same description that was positive with the same absolute value. I removed these for the sake of graphing. When I did, I noticed that the data was highly skewed, with most donations at less than $1000 dollars but some at almost $3000. I applied a log ten scale and realized that (not surprisingly) people tend to give donations at regular amounts.

I immediately noticed that although Sanders is close to Clinton in terms of number of donations, he isn’t in terms of amount raised.

Looking at these two graphs side by side, Cruz and Carson seem to have dropped relatively as well. I decided to compare these two variables (number of donations vs total donation size).

Generally, there is a clear correlation between number of donors and total donation amount. However, some candidates raise more money than would be expected by the number of donations, others less. An interesting observation that I made was that candidates lying above the trendline on this graph are generally considered “establishment” candidates and have held high-level offices in government, generally having large amounts of experience. Those lying below include some establishment candidates but also three “upstart” senators (Paul, Cruz, and Sanders) and three candidates who have never held elected office (Trump, Lessig, and Carson). This seems to hold true for both parties. This observation makes a good amount of sense: establishment candidates tend to have the better-funded experienced political forces donating to their campaigns, while candidates attempting to disrupt will be less likely to. Another observation that I made was that candidates below the margin of error seemed to be doing better in polls than those above on average (except Clinton). I made another graph to investigate:

It would seem that this election, Republicans with higher average donations generally have lower support in polls. Of course, there is no way to know given this data why this is, but I hypothesize that it may have something to do with the anti-establishment fervor this election. This trend does not seem to hold true for Democrats, although of course there are only three datapoints.

## 
##  Pearson's product-moment correlation
## 
## data:  subset(summarised, prty = "R")$mean and subset(summarised, prty = "R")$polls
## t = -1.9287, df = 19, p-value = 0.06885
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.71195723  0.03278233
## sample estimates:
##        cor 
## -0.4046307

The correlation appears to be negative as noted, though it is not statistically significant. Next, I decided to look at how much each candidate raised per percent support in the polls.

Clearly, Bush is raising a huge amount of money for the small support he has in the polls. Clinton is also raising proportionally higher numbers. Sanders, Carson, and Christie are all raising lower amounts for their support. And Donald Trump is also raising almost nothing for his high support. Next I turned my attention to regional differences.

Using ggmap and TaRifx.geo, a familiar story presents itself. Support for Democrats is strongest in urban areas such as Los Angeles and the San Francisco Bay. Support for Republicans comes from more rural areas. However, since donors in urban areas far outnumber rural areas, overall Democrats have more support.

I took a look at the specific Democrats (of which only Sanders and Clinton have any support). Clinton has a definite edge in Los Angeles, Silicon Valley, and San Francisco. Sanders is leading slightly in places like Oakland, Sacramento, and San Jose. Next, I decided to look at the Republicans with the most donations (Cruz, Carson, and Rubio).

Interestingly, Rubio seems to be doing best in the same places that Clinton was, such as Silicon Valley and Los Angeles while Carson seems to be doing best in more suburban or rural areas as well as Oakland. Cruz seems to lead mostly in Fresno and Bakersfield, though he also has footholds in San Jose and Los Angeles I knew that most of the donations were coming from the cities, but is that were the money is coming from too? I tried to create an effective visualization, trying both size and alpha, but neither seemed to provide any relevant insights. So I turned my attention to donor occupations.

## ed_subset$contbr_occupation
##                              .COM EXECUTIVE             (RETIRED) 
##              1948.754               250.000                50.000 
##                     0 100% DISABLED VETERAN    19327 CITRONIA ST. 
##              1000.000                29.480                50.000
## [1] SR. GEOTYPICAL PRODUCTION ASSISTANT NOT EMPLOYED                       
## [3] SELF                                RETIRED                            
## [5] MILITARY OFFICER                    REAL ESTATE INVESTOR               
## 4464 Levels:  .COM EXECUTIVE (RETIRED) 0 ... YOUTUBER
## Source: local data frame [5,026 x 4]
## Groups: contbr_occupation [4446]
## 
##                         contbr_occupation   prty     mean     n
##                                    (fctr) (fctr)    (dbl) (int)
## 1                                 RETIRED      R 223.6667  8846
## 2                            NOT EMPLOYED      D 167.6823  4702
## 3                                 RETIRED      D 397.9018  2945
## 4                                ATTORNEY      D 909.3816  1502
## 5  INFORMATION REQUESTED PER BEST EFFORTS      R 400.9745  1125
## 6                   INFORMATION REQUESTED      D 389.7675   964
## 7                               HOMEMAKER      R 936.0519   870
## 8                               PHYSICIAN      D 475.5738   566
## 9                                 TEACHER      D 208.3525   541
## 10                             CONSULTANT      D 715.1457   486
## ..                                    ...    ...      ...   ...

I immediately noticed that some of the professions had many more donations going to one party than the other. I noticed this even in the ranges with thousands of people. For example, it is evident that many more retirees donating donated to Republicans over Democrats. I combined this into a new summary.

## Source: local data frame [4,446 x 4]
## 
##                         contbr_occupation percent_d     n      mean
##                                    (fctr)     (dbl) (int)     (dbl)
## 1                                 RETIRED 0.2497668 11791  621.5685
## 2                            NOT EMPLOYED 0.9978778  4712  227.2483
## 3                                ATTORNEY 0.7810712  1923 1800.2590
## 4                               HOMEMAKER 0.3458647  1330 2220.5844
## 5                   INFORMATION REQUESTED 0.7966942  1210 1121.3328
## 6  INFORMATION REQUESTED PER BEST EFFORTS 1.0000000  1125  400.9745
## 7                               PHYSICIAN 0.6520737   868  840.1917
## 8                                 TEACHER 0.7534819   718  340.8725
## 9                                ENGINEER 0.4949928   699  592.2246
## 10                                    CEO 0.6298422   697 2673.5223
## ..                                    ...       ...   ...       ...

Some stark differences can be seen. More than 75% of retirees donating donate to Republicans, while people who are not employed donated overwhelmingly (96.6%) to Democrats. These two in particular make sense, as typically older people are more conservative and Democrats tend to support more welfare for the unemployed. As for who donated the most:

The results are not suprising: attorneys, CEOs and Presidents donate the most, which makes sense because they are paid the most. On the other side, the unemployed and teachers donate relatively little, which again corresponds to their relatively low (or nonexistent) incomes. I then wanted to see whether professions with more Democrats donated less or more.

## 
##  Pearson's product-moment correlation
## 
## data:  percentagessummary$percent_d and percentagessummary$mean
## t = -13.41, df = 4444, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2253038 -0.1687969
## sample estimates:
##        cor 
## -0.1972142

There is a definitely a negative correlation. Workers with more Democratic occupations are donating less on average than those with more Republican occupations.

## Source: local data frame [9,652 x 4]
## Groups: contbr_employer [9337]
## 
##                           contbr_employer   prty     mean     n
##                                    (fctr) (fctr)    (dbl) (int)
## 1                                 RETIRED      R 223.2021  8704
## 2                            NOT EMPLOYED      D 163.9095  5454
## 3                           SELF-EMPLOYED      D 835.4545  3190
## 4                                     N/A      D 715.9710  2769
## 5                           SELF-EMPLOYED      R 632.7118  1489
## 6                           SELF EMPLOYED      D 150.2305  1263
## 7  INFORMATION REQUESTED PER BEST EFFORTS      R 413.7171  1237
## 8                                 RETIRED      D 388.8897  1101
## 9                                    SELF      R 250.2825   949
## 10                  INFORMATION REQUESTED      D 378.7080   927
## ..                                    ...    ...      ...   ...

I noticed that the top 17 observations weren’t actually organizations or companies employing anyone. So I cut off these values.

Clearly, almost all of these employers are strongly Democrat. Employees of the Creative Artists Agency donated completely to Democrats. So did employees of Hillary Clinton’s political campaign. Google Employees donated overwhelmingly to Democrats too. The most Republican employer was Kaiser Permanente, which still had over 60% of its donations to Democrats. For some reason, despite Unemployed being very Democrat when it was in occupation, is very Republican when specified as an employer. I really don’t know why this is, and because there are many ways to specify unemployed (ie. “Not Employed”, “None”, etc.) this seems to be an outlier. Looking at the very Democratic donations made by these employees, I wondered what the breakdown was within Democrats.

Employees of Creative Artists Agency and Hillary For America both donated exclusively to Clinton. Los Angeles County and UCLA employees donated more to Sanders. United Airlines employees donated overwhelmingly to Sanders, while Wells Fargo employees donated to Clinton much more. One surprising result was that at both Google and Apple, about 10% of donations went to Lawrence Lessig, a candidate who got almost no support otherwise. Again, I wanted to see how much these occupations were donating.

It is clear that Facebook has extrodinarily high donation amounts, as does Stanford, Creative Artists Agency, and Google. Employees of the Clinton campaign, the Unemployed, and United Airlines donate much less. This seems to reflect at least roughly the salaries that employees of these companies make.

I decided that now, after doing some exploration, I would create a regression attempting to estimate a donor’s donation size.

## [1] "Zip code only r^2:"
## [1] 0.2244755
## [1] "Zip code and candidate r^2:"
## [1] 0.333761
## [1] "Zip code, candidate, and occupation r^2:"
## [1] 0.4889345

With a relatively good r^2 value of 0.488, the final model seems fairly effective for using only three variables to predict a very complicated event. However, because of the number of occupations that were included with few people in them, the model seems unlikely to keep up this value with new data. I then decided to see if I could predict party affiliation using zip code and occupation.

## [1] "Just zip code r^2:"
## [1] 0.3024183
## [1] "Zip code and occupation:"
## [1] 0.590678
## [1] "Baseline accuracy: "
## [1] 0.5605317
## [1] "Accuracy of Model: "
## [1] 0.7758924

The model can predict which party an individual will donate to based on the zip code that they live in fairly accurately (with an R-squared value of .565). It can predict which party an individual donates to with 80.6% accuracy. To compare, assuming that every donor donates to Democrats gives a 56.1% accuracy. This jump in accuracy is very good, and supports the observations from before that party affiliation is very regional and occupation based. The two models that I built seemed to be ok for the circumstances. But I wanted to see how well they would hold up with newly collected data. I downloaded a new dataset from the FEC, but unfortunately the data had not been updated from when I first downloaded it. At this point, I was ready to conclude this investigation.

Final Plots and Summary

Some candidates have many supporters but raise little money.

##          n
## 1 2.568504

This graph revealed the discrepancies between number of supporters and donation amounts. Some candidates that had many donors did not raise very much relatively and vice versa. For example, Bernie Sanders had about 92% as many donations as Hillary Clinton (14,515 vs 15,790) but only raised about 16% what she did ($2,154,745 vs $13,575,640). Similarly, Ted Cruz raised about 51% what Marco Rubio dod despite having roughly 2.57 times the donations.

Establishment candidates have higher average donation amounts

This graph revealed that the average donation size is heavily influenced by whether candidates are considered “establishment” or not. Candidates who have lower average donations (and thus fall under the trendline) tend to consider themselves to be “outsider” or “anti-establishment” candidates, while those with higher average donations (falling above the trendline) tend to be more “insider” or “establishment” candidates. I hypothesized that this could be due to the establishment candidates have deeper connections with bigger donors while anti-establishment candidates tend to distance themselves from large donors. Some examples of candidates above the trend line in the establishment category are Hillary Clinton (Mean Donation: $860), Jeb Bush ($1601), and Marco Rubio ($825). Many candidates below the trendline had radically different numbers, such as Bernie Sanders ($148), Ben Carson ($160), and Ted Cruz ($162).

Democrats are Concentrated in the Cities

This map showed the distribution of Democrat and Republican donors across California. Cities like San Francisco, Oakland, and Los Angeles have mostly have Democrat donors, while more rural or suburban areas have more Republican donors. This illustrates the well known concept that more rural voters are more conservative with urbanites being more liberal.

Reflection

Overall, this was a very interesting and revealing investigation. It confirmed many of the ideas that I had about the current election but that I had never confirmed. For example, I learned that political “outsiders” really do have smaller donations than insiders and that there really are more Democrats in the cities. I also learned new information, such as the negative correlation of average donation amount and polling averages in the Republican party.
I thought that a strength of my investigation was the new data that I introduced (such as polling averages) and how they improved my overall undertanding. A major weakness of my investigation was the lack of data on Super Pacs. Super Pac data is not available from the FEC, but still makes up a large amount of the monetary value of donations (if not number of donors).
Some ideas for expansions of this project: Find a way to include Super Pac data.
Determine the gender of donors based on their names and investigate it as a variable.
Investigate employer as a variable.

I’ve learned a lot from doing this exploration and I hope to continue to learn more throughout the Nanodegree.

Citations:

Libraries: H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

Hadley Wickham (2015). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.0.0. http://CRAN.R-project.org/package=stringr

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. URL http://www.jstatsoft.org/v40/i01/.

Hadley Wickham and Romain Francois (2015). dplyr: A Grammar of Data Manipulation. R package version 0.4.3. http://CRAN.R-project.org/package=dplyr

Barret Schloerke, Jason Crowley, Di Cook, Heike Hofmann, Hadley Wickham, Francois Briatte, Moritz Marbach and Edwin Thoen (2014). GGally: Extension to ggplot2.. R package version 0.5.0. http://CRAN.R-project.org/package=GGally

Baptiste Auguie (2015). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package version 2.0.0. http://CRAN.R-project.org/package=gridExtra

Ari B. Friedman (2014). taRifx.geo: Collection of various spatial functions. R package version 1.0.6. http://CRAN.R-project.org/package=taRifx.geo

D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

Martin Elff (2015). memisc: Tools for Management of Survey Data, Graphics, Programming, Statistics, and Simulation. R package version 0.97. http://CRAN.R-project.org/package=memisc

Stack Overflow Posts:

https://stackoverflow.com/questions/4350440/split-a-column-of-a-data-frame-to-multiple-columns

https://stackoverflow.com/questions/12910218/set-specific-fill-colors-in-ggplot2-by-sign

https://stackoverflow.com/questions/10128617/geocodes-if-characters-in-string-in-r

https://stackoverflow.com/questions/15624656/labeling-points-in-geom-point-graph-in-ggplot2

https://stats.stackexchange.com/questions/18233/using-predict-function-in-r

https://stackoverflow.com/questions/1296646/how-to–a-dataframe-by-columns

Polling Data:

http://elections.huffingtonpost.com/pollster/2016-california-republican-presidential-primary#!partisanship=N&estimate=custom

http://elections.huffingtonpost.com/pollster/2016-california-democratic-presidential-primary#!maxdate=2015-11-21&partisanship=N&estimate=custom

Other: http://www.geo.ut.ee/aasa/LOOM02331/heatmap_in_R.html